58

Algorithms for Binary Neural Networks

Algorithm 3 Progressive Optimization with Center Loss

Input: The training dataset; the full-precision kernels C; the pre-trained kernels tC from ternary

PCNNs; the projection matrix W; the learning rates η1 and η2.

Output: The binary PCNNs are based on the updated C and W.

1: Initialize W randomly but C from tC;

2: repeat

3:

// Forward propagation

4:

for l = 1 to L do

5:

ˆCl

i,j P(W, Cl

i); // using Eq. 3.43

6:

Dl

i Concatenate( ˆCi,j); // using Eq. 3.45

7:

Perform activation binarization; //using the sign function

8:

Traditional 2D convolution; // using Eq. 3.46, 3.47 and 3.48

9:

end for

10:

Calculate cross-entropy loss LS;

11:

if using center loss then

12:

L = LS + LC;

13:

else

14:

L = LS;

15:

end if

16:

// Backward propagation

17:

Compute δ ˆ

Cl

i,j =

∂L

ˆ

Cl

i,j ;

18:

for l = L to 1 do

19:

// Calculate the gradients

20:

calculate δCl

i; // using Eq. 3.49, 3.51 and 3.52

21:

calculate δW l

j ; // using Eq. 3.115, 3.116 and 3.56

22:

// Update the parameters

23:

Cl

i Cl

i η1δCl

i; // Eq. 3.50

24:

W l

j W l

j η2δW l

j ; //Eq. 3.54

25:

end for

26:

Adjust the learning rates η1 and η2.

27: until the network converges

3.5.8

Ablation Study

Parameter As mentioned above, the proposed projection loss, similar to clustering, can

control quantization. We computed the distributions of the full-precision kernels and vi-

sualized the results in Figs. 3.14 and 3.15. The hyperparameter λ is designed to balance

projection loss and cross-entropy loss. We vary it from 1e3 to 1e5 and finally set it

to 0 in Fig. 3.14, where the variance increases as the number of λ. When λ=0, only one

cluster is obtained, where the kernel weights are tightly distributed around the threshold

= 0. This could result in instability during binarization because little noise may cause a

positive weight to be negative and vice versa.

We also show the evolution of the distribution of how projection loss works in the training

process in Fig. 3.15. A natural question is: do we always need a large λ? As a discrete

optimization problem, the answer is no, and the experiment in Table 3.4 can verify it, i.e.,

both the projection loss and the cross-entropy loss should be considered simultaneously

with good balance. For example, when λ is set to 1e4, the accuracy is higher than those

with other values. Thus, we fix λ to 1e4 in the following experiments.

Learning convergence For PCNN-22 in Table 3.2, the PCNN model is trained for 200

epochs and then used to perform inference. In Fig. 3.16, we plot training and test loss with

λ = 0 and λ = 1e4, respectively. It clearly shows that PCNNs with λ = 1e4 (blue